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Abstract — The lifting scheme based Discrete Wavelet 
Transform is a powerful tool for image processing 
applications. The lack of disk space during transmission and 
storage of images pushes the demand for high speed 
implementation of efficient compression technique. This paper 
proposes a highly pipelined and distributed VLSI architecture 
of lifting based 2D DWT with lifting coefficients represented 
in fixed point [2:14] format. Compared to conventional 
architectures [11], [13]-[16], the proposed highly pipelined 
architecture optimizes the design which increases 
significantly the performance speed. The design raises the 
operating frequency, at the expense of more hardware area. 
In this paper, initially a software model of the proposed design 
was developed using MATLAB 8 . Corresponding to this 
software model, an efficient highly parallel pipelined 
architecture was designed and developed using verilog HDL 
language and implemented in VIRTEX 8 6 (XC6VHX380T) 
FPGA. Also the design was synthesized on TSMC 0.18(im 
ASIC Library by using Synopsis Design Compiler. The entire 
system is suitable for several real time applications. 

Index Terms- ID DWT, Lifting, Parallel pipelining, FPGA, 
ASIC. 

I. Introduction 

For the last two decades, the DWT has gained establish- 
ing role in signal processing and image processing applica- 
tions because of their ability to decompose the signal into 
different sub bands with both time and frequency informa- 
tion. DWT also has features like progressive image transfor- 
mation, ease of compressed image manipulation, region of 
interest coding etc. Earlier DCT was used for image compres- 
sion applications, but it has several shortcomings such as 
blocking artifact and bad subjective quality images are re- 
store at high compression ratio. DWT has been traditionally 
implemented by means of the Mallat filter bank scheme [2] . 
DWT perform multi resolution analysis which localizes the 
signals in both frequency and time domain. The blocking 
artifact at high compression ratio is removed in DWT by its 
full frame nature which de-correlates the image over large 
scale. Earlier the implementation of wavelet transform was 
based on convolution algorithm of filters. But this approach 
requires a huge amount of computational resources. Hence 
lifting scheme was used for implementation of DWT. The 
lifting scheme based DWT has many characteristics, suit- 
able for VLSI hardware implementations. Lifting scheme based 
DWT provide significant reduction in computational com 
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plexity and memory usage by providing in-place computa- 
tion of wavelet coefficients. Lifting scheme has more advan- 
tages than convolution algorithm like saving of memory, com- 
putational efficiency, integer to integer wavelet transform [10], 
symmetric forward and inverse transform, etc. With the dis- 
covery of CDF (9, 7) filter banks by Cohen, Duabechies [1] 
and Feauveau, which has linear phase and excellent image 
compression performance, the application of DWT, has in- 
creased tremendously. Various Very Large-Scale Integration 
(VLSI) architectures of the DWT have presented in the litera- 
ture [3]-[7]. 

Several lifting based DWT architectures have been 
developed. Wu et al [ 1 1 ] have proposed a line based scanning 
schemes and a folded architecture. This architecture performs 
multilevel DWT and has simple control circuits. But it uses 
an external frame buffer. Recursive architecture [12] eliminates 
the requirement of external buffer, but it has a complex control 
circuit and requires more internal memory than folded 
structure. All these architectures works at a fixed processing 
speed, and hence cannot be extended to achieve higher 
processing speed. Higher processing speed can be achieved 
with parallel FIR structure, at the expense of more hardware 
area. 

The aim of this paper was to construct a highly pipelined 
and distributed VLSI architecture for CDF (9,7) lifting based 
2D DWT which meets high processing speed requirement, 
simple control signals and controlled increase of hardware 
cost. High speed processing was achieved by using 
techniques such as pipelining, distributed arithmetic, etc. The 
paper proposes the FDWT architecture for large throughput. 
The fixed point Q[2:14] format was used to represent the 
lifting coefficients which reduce the computation complexity 
and also bring a great advantage to VLSI hardware 
implementation. 

II. Lifting Scheme Based Discrete Wavelettransform 

Lifting scheme based discrete wavelet transform also 
called as the second generation wavelet, was introduced by 
Sweldens [8], [9] is based entirely on the spacial method. 

A. Lifting Scheme realization 

The lifting algorithm can be computed in three main 
phases, namely: the split phase, the lifting phase and the 
normalization phase, as illustrated in Figured. The lifting 
scheme algorithm can be described as follow: 
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- Split phase: The original signal, X(n), is split into odd 
and even samples. 

Xe=X(2n) (1) 
X=X{2n+\) (2) 

- Lifting phase: This phase is executed as N sub-steps, 
where the odd and even samples are filtered by the prediction 
and update filters, Pn(n) and Un(n). 

In this phase the even samples are multiplied by predictor 
operator and are used to predict the odd samples. The 
difference between the odd sample and the predicted value 
gives the detail coefficient. Then the even samples are 
updated with detail coefficients to get smooth coefficients. 

Y (2n+Y) = Xo(2n+l )+Pn(Xe) (3) 
Y(2n) = y(2n+l)+Un(Xe) 

(4) - Normalization or Scaling step: After N 
lifting steps, a scaling coefficients K and 1/K are applied 
respectively to the odd and even samples in order to obtain 
the low-pass coefficients (YL(i)), and the high-pass 
coefficients (YH(i)). 

Y L (i)=K*Y(2n) (5) 
Y H (i)=l/K*Y(2n+l) (6) 

Zo-'.H Lft-i Xcalim 




Figure 1. General diagram of the lifting process 
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Figure 2. CDF(9,7) lifting filter architecture 

The CDF(9,7) wavelet filters is shown in Figure. 2. The 
architecture consists of two lifting stages, where a, a, a, a are 
called the four lifting coefficients and K is the scaling factor 
or normalization constant. From [9] the lifting and scaling 
coefficients can be expressed as: 
a =-1.586134342, 
P =-0.0529801 185, 
7=0.882911076, 
5=-0.443506852, 
and K= 1.1 49604398 
The lifting scheme algorithm to the (9,7) filter is applied 

as: 
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Split step: 

Xe — X(2i) Even Samples 

Xo <— X(2i+1) Odd Samples 
Lifting Steps: 

For (9, 7) filter, number of lifting stages, N=2 
Predict PI : D/i) = Xo(i) + a [Xe(i) + Xe(i+1)] 
Update Ul: S^i) = Xe(i) + P [D/i-l) + D^i)] 
Predict P2: D 2 (i) = D^i) + y [Sj (i) + S^i+l)] 
Update U2: S _ 2 (i) = Sfi) + 5 [D 2 (i-1) +D 2 (i)] 
Scaling Step: 

YH(i)=l/K*D 2 (i) 
YL(i) = K*S 2 (i) 

B. Two-Dimensional Discrete Wavelet Transform 

The basic idea of 2-D architecture is similar to 1-D 
architecture. A 2-D DWT can be seen as a 1-D wavelet 
transform along the rows and then a 1-D wavelet transform 
along the columns. The 2-D DWT operates in a 
straightforward manner by inserting array transposition 
between the two 1-D DWT. The rows of the array are 
processed first with only one level of decomposition. This 
essentially divides the array into two vertical halves, with 
the first half storing the average coefficients, while the second 
vertical half stores the detail coefficients. This process is 
repeated again with the columns, resulting in four sub-bands 
within the array defined by filter output. The LL sub-band 
represents an approximation of the original image. The other 
three sub-bands HL, LH, and HH contain higher frequency 
detail information (mostly local discontinuities in the edges 
of the image). This process is repeated for as many levels of 
decomposition as are desired. 

III. Hardware Implimentation 

The 2D DWT has a fundamental role in compression 
algorithms. However, because of its complexity in hardware 
implementation, a significant number of studies have been 
devoted to the design of architecture that effectively utilizes 
the available resources. 

A. Proposed2D DWT System Architecture 

In this work a highly pipelined lifting scheme based 2D 
FDWT has been implemented. Acceleration in design has 
been achieved through parallel processing of independent 

sub modules. Figure3 shows the proposed highly 
pipelined lifting based 2D DWT architecture. 

The proposed architecture consist of the following units: 
stage 1 DWT processor, stage2 DWT processor, DWT 
controller, tile memory, even memory, odd memory, combined 
memory, transpose memory and two scaled and detail 
memories. DWT controller was used to control the overall 
process and was designed as a finite state machine with simple 
control signals. All the memory modules used in this design 
are dual ported RAM (DPRAM), that allows multiple read or 
write to occur at the same time. Each of these memories has 
two banks (bank 1 & bank 2). The common clock and reset 
was given to each module in order to maintain synchronism. 

The data flow between different modules is explained in 
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Figure 3. Proposed 2D DWT architecture 

detail in this section. Initially the input image coefficients are 
stored linearly in the external memory. The controller reads 
the external memory tile wise and writes to bank 1 of tile 
memory. After writing the first tile data into tile memory, the 
controller will take the first tile data row wise and split them 
into even and odd image coefficients and store them in the 
bankl of even and odd memories. Then the controller ad- 
dresses these coefficients to stage 1 DWT processor and 
addresses the transformed coefficients, i.e. the scaled and 
detail coefficients to bankl of first scaled and detail memo- 
ries. Then the controller addresses these stage 1 transformed 
coefficients to stage2 DWT processor and addresses the 
satge2 transformed coefficients, i.e. the scaled and detail 
coefficients to bank 1 of second scaled and detail memories. 
Then this process is repeated for each row in the entire tile 
image, and finally the result of each consecutive rows are 
saved in the bank 1 of second scaled and detail memory. By 
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this consecutive stagel and stage2 operation on tile image 
we will get the row wise decomposed coefficients. After the 
row wise decomposition, the scaled and detail coefficients 
are assembled together for column wise decomposition. Once 
the row processing has been complete on the first tile, the 
results of stage2 DWT processor was combined to form a 2D 
array and is stored in bank 1 of combined memory. For 
column processing the combined 2D array was transposed 
and stored in bankl of transpose memory. From the trans- 
pose memory again the data was split as even and odd coef- 
ficients, stored in bankl of even and odd memories and the 
process of stagel and stage2 are repeated until column pro- 
cessing was done on the first tile. When the column process- 
ing of first tile is being done, row processing on the next tile 
will be done in parallel. The next tile will be written into bank2 
of tile memory. The controller will take this data row wise, 
split them as even and odd coefficients and are stored in 
bank 2 of even and odd memories. These even and odd coef- 
ficients are given as to input to stagel DWT processor. The 
result of stagel processing will be stored in bank 2 of first 
scaled and detail memory. Then the stage2 processing was 
done on the result of stagel processing. Once the row wise 
processing of this tile is complete, the result was given for 
column operation. The column processing this tile will be 
done in parallel with row operation on next tile. Hence we 
achieve high level pipelining between the row and column 
processing. The above operations are repeated for entire 
image tile wise. Finally the scaled coefficients on the lowest 
frequency sub band (LL) of each tile image are assembled to 
form a 2D array of level 1 compressed image. The same rou- 
tine was again applied on this lowest frequency sub band to 
get level2 compressed image. 

B. Stagel and Stage2 DWT Processor 

This module was mainly used for transformation of image. 
In this process image are transformed and hence the detail 
(high pass) and scaled (low pass) coefficients are generated. 
The stagel DWT processor consist of registers, adders and 
multipliers as shown in figure4 and the stage2 DWT processor 
consist of registers, adders, multipliers and multiplexers as 
shown in figure5. The input data are 16 bit each. The input 
data from tile memory was taken row wise, divided into even 
and odd data and stored in even and odd memories. The 
input to the stagel processor comes from these even and 
odd memories. The outputs from the stagel DWT processor 
are store in first scaled and detail memory. Stagel DWT 
processor outputs will be given as input to stage2 DWT 
processor and the output of stage2 DWT processor will be 
store in second scaled and detail memory row wise. The output 
of stage 1 and stage2 DWT processors are the scaled and 
detail coefficients. 

For stage 1 DWT processor the pixel of the image are given 
as input. Data from the even memory and odd memory was 
used for scaled and detail coefficient generation. Initially the 
current even pixel value and next even pixel value are added 
and in turn given to multiplication process with filter 
coefficient. Finally the detail coefficients are achieved from 
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the addition process of multiplied output and odd pixel value. 
Again these detail coefficients are taken and added with its 
previous value and in turn given to multiplication process 
with filter coefficient. The resultant is added with even pixel 
value which gives the scaled coefficients. Hence all the values 
from even and odd memory will be taken and this process will 
be repeated in order to achieve the scaled and detail 
coefficients of the entire row. 

Now these scaled and detail coefficients were taken as 
input for the further process. Hence for the stage2 DWT 
processor, these scaled and detail coefficients are taken as 
input and will do the stage2 processing in order to obtain the 
scaled and detail coefficients from the transformed coefficient 
of stage 1 DWT processor. In stage2 DWT processor the 
same process as in stage 1 DWT processor is done, but here 
the input data are taken from the first scaled and detail memory 
row wise and finally the obtained scaled and detail coefficients 
are multiplied by filter gain for normalization. 




Figure 4. Stagel 2D DWT processor 
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Figure 5. Stage2 2D DWT processor 

The lifting coefficients are stored in fixed point Q[2: 14] 
format. Table I shows the fixed point representation of the 
CDF (9,7) lifting coefficients. 

Table I. Fixed point representation of lifting coefficients 



Co-efficient 


Original value 


Q[2:14] 
representation 


a 


-1.586134342 


-25987 


P 


-0.0529801185 


-868 


Y 


0.882911076 


14465 


S 


-0.443506852 


7266 


K 


1.149604398 


18835 


1/K 


0.869864452 


14252 



C. Highly Parallel Pipelined Image Processing 

When the whole image is to be processed, the pixel values 
are divided into N number of 256x256 size tiles. These tiles 
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are processed one after the other. Each tile was processed in 
two steps. First row wise processing was done and then 
column wise processing. Within the row wise and column 
wise processing, stagel and stage2 processing is done row 
wise. If processing of second tile was done only after 
complete processing of first tile, then more clock cycles will 
be consumed. Highly parallel pipelined processing is applied 
here as a solution to this problem. Then while column 
processing of first tile is being done, row processing of next 
tile will be done in parallel manner so that as the next step 
column processing of the next tile can be done with row 
processing of the tile after it. Also the stagel and stage2 
processing of each tile was done row wise, i.e. while the 
stage2 processing of first row within the tile is being done, 
stage 1 processing of next row will be done in parallel manner 
so that as the next step stage2 processing of the next row can 
be done with stagel processing of the row after it The 
processor fetch and process the row values of tile N+l while 
it process the column values of tile N. Thus the column 
processing of tile N and row processing of tile N+l will be 
completed at the same time. Without pipelining, if there were 
N tiles, 2N steps are required for the complete image 
processing. This can be reduced to N+l step if pipelining is 
employed. Thus the whole image can be processed in a much 
reduced time. 

IV. Implementation Results 

A software model for the proposed 2D DWT was 
developed, implemented and simulated using MATLAB. 
Based on this software model, an efficient highly pipelined 
hardware architecture for the 2D DWT processor was 
developed in verilog hardware descriptive language and 
synthesized for VERTEX 6 (XC6VHX380T) FPGA and ASIC 
technologies. The proposed architecture is in the class of 
the fastest implementation, because of its highly parallel 
pipeline processing that limits the critical path delay. The 
proposed system use large number of memories because of 
the introduced highly pipelined processing. Compared with 
the similar architectures, computation time of proposed 
architecture was reduced with controlled increase of hardware 
cost. Here Modelsim tool was used in order to simulate and 
check the functionality of the design. Once the functional 
verification was done, the design was taken to XILINX tool 
for synthesis process. The DWT schematic with basic inputs 
and outputs are shown in figure 6. 
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Figure 6. DWT schematic with basic inputs and outputs 
D. Performance Comparison 

Table II shows the performance in terms of processing 
speed of the proposed system on a 512x512 image. 
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Table II. Processing speed of the DWT system 



Qodc reriafc 6.02 ns(Bequency=166MHz) 


ftocess 


Timetaken(ns) 


life 1- Row 1- Stagel Recessing 


616 


Tile 1- Row 1- Stage2 Itocessing and Tilel- Row2 - Stagel processing 


616 


life 1- Row Recessing 


158312 


life l-CblurmftocessingandTile2-Rowprocessing 


158312 


Total time for prooessing 512x512irmge 


791560 



A 5 1 2x5 1 2 image has four 256x256 tiles. The time taken for 
processing one tile will be 0. 16ms. Hence for highly parallel 
pipelined processing the total time taken will be 5(4+1 ) times 
the processing time for one tile, which will be equal to 0.79ms. 
Hence the proposed highly parallel pipelined architecture 
helps to achieve better performance for high resolution 
images like HD (1920x1080) and 720 P image (1280 x720). 
The performance comparison of several 2D DWT architecture 
for CDF (9,7) with the proposed architecture is shown in 
table III. 

Table III. Comparison Of Several 2D DWT Architecture For CDF 
(9,7) 



Architecture 


Multiplier 


Adder 


DWT 

scheme 


Chrysafis[13] 


32 


28 


Convolution 


Wu[ll] 


32 


32 


Convolution 


Barua[14] 


10 


16 


Lifting 


Cheng[15] 


24 


76 


FIR 


Dillen[16] 


16 


24 


Lifting 


Proposed 


10 


16 


Lifting 



Table IV depicts the design synthesis results for FPGA 
and table5 depicts the design synthesis results for ASIC 
technologies. The presented result was obtained for fixed 
point [2.14] representation of lifting coefficients. 

Table IV. Design Synthesis Results For FPGA Technology 



Technology 


Xilinx Vertex-6 


Area utilization 


Slice LUTs 


985 


Slice registers 


506 


Memory 


1027kB 


DSP blocks 


15 


Maximum frequency 


166 MHz 



Table V. Design Synthesis Results For ASIC Technology 



Technology 


TSMC 0.18Lun 


Combinational area utilization 


21453 


Sequential area utilization 


5378 


Gate count estimation 


9.58 K 


Maximum frequency 


193 MHz 



V. Conclusion And Future Work 

An efficient highly pipelined VLSI architecture design of 
2D bi-orthogonal CDF 9/7 2D DWT is proposed in this paper. 
In this work we have successfully implemented the proposed 
highly pipelined VLSI architecture on Xilinx FPGA and TSMC 
0. 1 8 tun. The proposed design is suitable for variety of hard- 
ware implementation to meet the different processing speed 
requirement with controlled increase in hardware cost. 



Applied highly parallel pipeline processing reduces the criti- 
cal path delay. The design works with an estimated frequency 
of 166MHz for Xilinx Virtex6 FPGA and 193 MHz for TSMC 
0.1 8um ASIC Library. 
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